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ABSTRACT 

The  flux  tensor  motion  flow  algorithm  is  a  versatile  computer 
vision  technique  for  robustly  detecting  moving  objects  in  clut¬ 
tered  scenes.  The  flux  tensor  calculation  has  a  high  computa¬ 
tional  workload  consisting  of  3-D  spatiotemporal  filtering  op¬ 
erations  combined  with  3-D  weighted  integration  operations 
for  estimating  local  averages  of  the  flux  tensor  matrix  trace.  In 
order  to  achieve  efficient  real-time  processing  of  high  band¬ 
width  video  streams  a  data  parallel  multicore  algorithm  was 
developed  for  the  Cell  Broadband  Engine  (Cell/B.E.)  proces¬ 
sor  and  evaluated  in  terms  of  the  energy  to  computation  ef¬ 
ficiency  compared  to  a  fast  sequential  CPU  implementation. 
Our  multicore  implementation  is  12  to  40  times  faster  than  the 
sequential  version  for  HD  video  using  a  single  PS-3  Cell/B.E. 
processor  and  is  faster  than  realtime  for  a  range  of  filter  con¬ 
figurations  and  video  frame  sizes.  We  report  on  the  power 
efficiency  measured  in  terms  of  performance  per  watt  for  the 
Cell/B.E.  implementation  which  is  at  least  50  times  better 
than  the  sequential  version. 

Index  Terms —  Parallel  image  processing,  multicore 
Sony  Toshiba  IBM  Cell/B.E.  processor,  realtime  video/image 
processing,  3D  convolution,  power  efficient 

1.  INTRODUCTION 

Realtime  persistent  moving  object  detection  and  target  track¬ 
ing  for  surveillance  and  situational  awareness  applications 
providing  MOVINT  information  is  a  computationally  chal¬ 
lenging  problem.  Current  trends  in  distributed  sensor  net¬ 
works,  agile  systems  and  autonomous  intelligent  systems 
favor  the  processing  of  large  volumes  of  raw  data  closer 
to  the  sensor  to  reduce  network  transmission  requirements, 
extract  high  priority  scene  information  more  rapidly  and  ex¬ 
change  integrated  information  for  cooperative  downstream 
processing,  cross-cueing  sensors,  and  decentralized  infor¬ 
mation  fusion.  Energy  efficiency  is  an  important  system 


architecture  requirement  for  embedded  processing  and  op¬ 
timizing  data  management  requirements  using  distributed 
platforms.  Multicore  parallel  processing  environments  are 
widely  available  today  for  which  energy  efficient  versions  of 
image  and  video  processing  algorithms  need  to  be  designed 
that  can  dynamically  adjust  the  active  number  of  processors 
to  modify  the  completion  time  based  on  the  energy  budget. 
Scheduling  algorithms  to  minimize  total  energy  across  iden¬ 
tical  parallel  processors  has  been  shown  to  be  NP-hard  even 
for  unit-sized  jobs  [1], 

In  this  paper  we  characterize  the  workload  of  the  flux 
tensor  algorithm  for  moving  object  detection  in  high  band¬ 
width  video  streams.  The  parallel  flux  tensor  algorithm 
exhibits  super-linear  speed-up  due  to  the  vectorization,  loop 
unrolling,  short  vector  fused  multiply  add  operations  and 
double-buffering  optimizations  for  the  Cell/B.E.  architec¬ 
ture  which  along  with  the  power  efficiency  of  the  Cell/B.E. 
processor  provides  a  tremendous  improvement  in  the  perfor¬ 
mance  per  watt  metric  compared  to  an  optimized  sequential 
implementation.  Power  efficient  real-time  flux  tensor  pro¬ 
cessing  is  required  in  a  variety  of  operational  scenarios  in¬ 
cluding  video-based  net-centric  exploitation  and  tracking  on 
airborne  platforms  [2]  and  ground-based  multi-sensor  imag¬ 
ing  for  force  protection.  Net-centricity  provides  improved 
agility,  collaborative  distributed  sensing  and  layered  sensor 
fusion.  Such  agile  sensor  networks  need  to  be  further  en¬ 
hanced  to  minimize  overall  power  consumption  under  the 
constraint  of  still  yielding  the  best  exploitable  information 
in  a  timely  manner.  Embedded  video  processing  requires 
efficient  algorithms  in  terms  of  power-aware  computing  as 
well  as  parallelization  to  enable  real  time  performance  in 
analyzing  complex  video  [3,4], 

There  are  a  number  of  challenging  computer  vision  prob¬ 
lems  that  need  to  be  solved  for  stabilizing,  detecting,  extract¬ 
ing,  verifying  and  tracking  moving  objects  in  airborne  video 
[5-9].  In  this  paper  we  focus  on  one  part  of  the  video  process¬ 
ing  pipeline,  namely  power-efficient  realtime  moving  object 
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detection  that  is  robust  to  natural  environmental  conditions 
such  as  illumination  variation,  shadows,  clutter,  and  noise.  In 
order  to  reliably  detect  moving  blobs  in  unconstrained  video, 
we  use  the  recently  proposed  flux  tensor  ( Jp )  operator  [10, 
11],  which  captures  the  temporal  variations  of  the  optical  flow 
field  within  the  local  3D  spatiotemporal  volume.  The  flux  ten¬ 
sor  detects  only  the  moving  structures,  and  is  less  sensitive  to 
illumination,  focus  and  related  problems  compared  to  other 
moving  object  detection  algorithms  including  classical  back¬ 
ground  subtraction,  mixture  of  Gaussians  and  3D  structure 
tensor  orientation  estimation.  The  flux  tensor  motion  detec¬ 
tion  results  have,  in  general,  better  spatial  coherency  enabling 
more  accurate  motion-based  object  segmentation.  The  flux 
tensor  is  more  efficient  in  comparison  to  the  3D  grayscale 
structure  tensor  since  motion  information  is  more  directly  in¬ 
corporated  in  the  flux  calculation  which  is  less  expensive  than 
eigenvalue  decompositions  at  each  pixel  in  the  image  [11,12], 

This  paper  describes  a  parallel  implementation  of  the 
flux  tensor  optimized  for  the  multicore  Cell/B.E.  processor 
for  real-time  processing  of  high-bandwidth  video  streams  in 
power  constrained  environments.  Some  early  supercomput¬ 
ing  architectures  like  the  SIMD  MasPar  were  ideally  suited 
for  image  analysis  tasks  like  deformable  motion  estima¬ 
tion  [13],  The  PS-3  Cell/B.E.  processor  provides  a  modern 
power  efficient  single  chip  high  performance  computational 
platform,  with  seven  heterogenous  cores  -  one  Power  Pro¬ 
cessing  Element  (PPE)  and  six  (of  eight)  active  Synergistic 
Processing  Elements  (SPEs).  The  PPE  is  a  64-bit  processor 
that  is  binary-compliant  with  the  PowerPC  970  but  with  a 
simpler  architecture  supporting  dual  issue,  in-order  execu¬ 
tion.  Each  SPE  consists  of  a  3.2  GHz  Synergistic  Processing 
Unit  (SPU),  a  large  128-entry  128-bit  vector  register  file, 
a  small  256  Kbytes  of  private  local  store  memory,  short 
pipelines,  and  a  memory-flow  controller  (MFC)  to  access  the 
256  MB  of  shared  main  memory  using  non-blocking  DMA 
commands  at  25.6  Gbytes/s.  The  SPUs  are  in-order  dual¬ 
issue  statically  scheduled  short-vector  number  crunchers  with 
support  for  SIMD  instructions  operating  on  packed  multiple 
data  value  without  dynamic  branch  prediction.  The  PS-3 
version  of  the  Cell/B.E.  processor  is  optimized  for  single¬ 
precision  arithmetic  (double-precision  peak  is  less  than  11 
GFLOP/s)  with  truncation  rounding.  Each  SPE  can  perform 
25.6  GFLOP/s  single -precision  floating  point  operations  at 
3.2GHz.  The  Cell/B.E.  supports  both  single  program  multiple 
data  (SPMD)  and  multiple  program  multiple  data  (MPMD) 
parallel  programming  models  that  is  more  flexible  than  the 
single  instruction  multiple  data  (SIMD)  model  for  mapping 
heterogeneous  multithreaded  data  flow  execution  onto  SPEs. 

The  Cell/B.E.  offered  one  of  the  first  commercial  im¬ 
plementations  of  a  power  efficient  high  performance  single 
chip  multiprocessor  with  a  significant  number  of  general- 
purpose  programmable  cores  targeting  a  broad  set  of  work¬ 
loads  [14],  A  good  description  of  scientific  computing  and 
programming  on  the  Cell/B.E.  is  provided  in  [15]  and  other 


details  of  implementing  scientific  computing  kernels  and 
programming  memory  hierarchies  can  be  found  in  [16, 17]. 
In  [18],  the  authors  discuss  interesting  code  transformation 
techniques  for  moving  scientific  simulation  codes  to  the 
Cell/B.E.  and  [19]  describes  the  fastest  Fourier  transform  for 
the  Cell/B.E.  processor  (18.6  GFLOP/s).  Many  programming 
frameworks/platforms  like  RapidMind  [20],  MFC  (Multicore 
Framework)  by  Mercury  [21]  have  also  emerged  to  support 
efficient  programming  for  multicore  processors.  In  order 
to  reduce  complexities  of  task  management,  multithreading 
and  synchronization  for  programming  the  Cell/B.E.  some 
tools  for  mapping  serial  code  in  a  semi-automatic  fashion  are 
in  development  [22].  We  first  give  a  brief  overview  of  the 
flux  tensor  method  and  discuss  the  sequential  implementa¬ 
tion  along  with  the  computation  and  memory  characteristics. 
Then  we  discuss  the  parallel  architecture  issues  involved  in 
our  Cell/B.E.  implementation.  A  description  of  the  data  par¬ 
titioning  scheme  and  parallelization  procedures  to  map  the 
flux  tensor  algorithm  onto  the  Cell/B.E.  cores  is  followed  by 
experimental  results  of  speed-up  and  power  efficiency  ratios. 

2.  FLUX  TENSOR-BASED  MOTION  DETECTION 

The  3D  flux  tensor  was  shown  to  be  a  robust  and  compu¬ 
tationally  efficient  method  for  coherent  detection  of  moving 
regions  in  video  [10-12],  The  flux  tensor  is  a  computation¬ 
ally  more  efficient  operator  in  comparison  to  the  3D  grayscale 
structure  tensor  [23,24]  since  motion  information  is  more  di¬ 
rectly  incorporated  in  the  flux  calculation  without  the  neces¬ 
sity  for  computing  eigenvalue  decompositions  as  with  the  3D 
grayscale  structure  tensor.  We  summarize  the  mathematical 
description  of  the  flux  tensor  multidimensional  orientation  es¬ 
timation  method  to  describe  the  types  of  operators  needed  to 
compute  the  flux  tensor  quantity  for  robust  motion  estimation. 

In  order  to  reliably  detect  moving  structures  without  per¬ 
forming  expensive  eigenvalue  decompositions,  the  flux  tensor 
has  been  shown  to  be  a  more  robust  operator  in  comparison  to 
the  more  widely  used  structure  or  orientation  tensor  [11, 12], 
The  flux  tensor  is  composed  of  the  temporal  variations  in  the 
optical  flow  field  within  the  local  3D  spatiotemporal  volume. 
Computing  temporal  derivative  of  the  optical  flow  equation 
under  a  constant  illumination  model  and  setting  the  image 
brightness  acceleration  to  zero  gives, 

^  =  Ixt  Vx  T  lyt  tty  T  Itt  ttf  i  (1) 

where  7(x)  is  the  spatiotemporal  image  volume,  t,  is  time, 
v(x)  =  [ vx ,  vy,  vt ]  is  the  optic-flow  vector  at  x,  and  the  sec¬ 
ond  derivative  terms  are  defined  as, 

,  _a2/(x)  r  _  <92/(x)  r  _  <92/(x)  ^ 

xt  ~  dxdt'  vt  ~  dydt  ’  u  ~  dt  dt  {  ) 

The  Ixt  and  Iyt  terms  capture  information  about  moving 
edges  or  gradients  in  the  video  while  ht  incorporates  in¬ 
formation  on  moving  textures  and  temporal  illumination 


d  /d/(x) 
dt  \  dt 
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changes.  A  constrained  total  least  squares  solution  for  the 
velocity  field  using  Eq.  1  leads  to  the  structure  tensor  matrix, 
Jf(x,  W (x,  y)),  with  matrix  elements  given  by. 


./?(*,  W)  =  jw(*-  y)  ^  )  dy  (3) 


where  W (x,  y)  is  a  windowed  integration  kernel.  We  use  the 
trace  of  the  flux  tensor  matrix,  referred  to  as  Tr.Jp,  that  is 
defined  below. 


Tr~JF  =  In  ~  +  +  (4) 

as  the  computational  operator  to  reliably  detect  moving  re¬ 
gions  in  video  streams.  Each  term  in  Eq.  4  incorporates  in¬ 
formation  about  temporal  gradients  which  leads  to  efficient 
filtering  of  moving  image  features.  A  spatially  invariant  inte¬ 
gration  kernel  VE(x  —  y),  for  multidimensional  isotropic  lo¬ 
cal  averaging  is  used  for  low  power  operation  (instead  of  a 
more  expensive  spatially  varying  kernel)  and  is  applied  after 
the  derivative  filtering  stages  of  computation  in  the  flux  tensor 
trace  are  completed.  A  robust  statistics  formulation  can  be  in¬ 
corporated  in  the  flux  tensor  computation  to  further  improve 
performance  in  low  signal-to-noise  conditions  [25,26], 

2.1.  Numerical  Computation  of  the  Flux  Tensor 

The  calculation  of  the  second  derivative  operators  needed  to 
compute  the  trace  of  the  flux  tensor  matrix  are  implemented 
as  convolutions  with  appropriate  kernel  filters.  Although  gen¬ 
eral  3D  convolution  kernels  can  be  used,  separable  kernels 
are  preferred  as  the  3D  convolutions  then  can  be  decomposed 
into  a  cascade  of  ID  convolutions  with  a  substantial  reduc¬ 
tion  in  computational  cost  from  0(n|)  to  O(rifc)  for  an  nk  x 
n k  x  rik  sized  filter.  For  numerical  stability  as  well  as  noise 
reduction,  a  smoothing  filter  is  applied  along  the  third  dimen¬ 
sion  that  is  not  involved  in  the  specific  second  derivative  fil¬ 
ter.  The  calculation  of  the  first  component  of  the  trace,  Ixt, 
uses  derivative  filters  in  the  x-  and  (-dimensions  and  smooth¬ 
ing  along  the  y-dimension,  whereas  calculation  of  Iyt  uses 
smoothing  along  the  x-dimension.  The  final  component  of 
the  flux  tensor  matrix  trace,  Itt,  is  the  second  derivative  along 
the  temporal  direction  and  in  this  case  the  smoothing  is  ap¬ 
plied  along  both  spatial  dimensions.  The  integral  operator  is 
also  implemented  numerically  as  an  averaging  filter  decom¬ 
posed  into  three  ID  filters.  The  operation  flow  is  illustrated 
in  Figure  1.  The  data  flow  objects  IDxsy,  Isxsv  and  IsxDy 
represent  the  intermediate  spatial  convolution  results  required 
to  calculate  Ixt,  Iyt  and  Itt.  The  operator  modules  shown 
in  Figure  1,  are  the  spatial  smoothing  filters  Sx  and  Sy,  the 
spatial  derivative  filters  Dx  and  Dy,  both  in  the  x-  and  y- 
directions  respectively,  and  the  temporal  derivative  operators 
Dt  and  Dt(,  representing  first  and  second  derivative  filters  in 
t  respectively.  The  final  averaging  filters  are  the  integral  part 
of  the  flux  tensor  operator  with  Ax ,  A  y  and  At  representing 


averaging  filters  in  x-,  y-  and  (-directions  respectively.  The 
data  flow  shown  in  Figure  1  reflects  optimizations  for  a  se¬ 
quential  implementation.  Specifically,  the  summation  block 
is  being  done  prior  to  the  spatiotemporal  averaging  opera¬ 
tors  for  improved  computational  efficiency  but  at  the  expense 
of  increased  task  dependencies  and  reduced  parallelism  (see 
Eq.  5  discussion).  Exchanging  the  order  of  the  sum  and 
averaging  filters  will  increase  parallelism  but  would  require 
more  memory  or  additional  computation.  For  the  implemen¬ 
tation  shown  in  Figure  1,  calculating  the  flux  tensor  trace  for 
each  video  pixel  requires  eight  ID  convolutions  for  the  three 
spatiotemporal  derivatives  and  three  ID  convolutions  for  lo¬ 
cal  averaging  filters  within  the  corresponding  spatiotemporal 
cubes.  The  number  of  temporal  filtering  operations  is  reduced 
by  saving  intermediate  results  using  additional  memory. 

The  filter  lengths  or  tap  sizes  associated  with  the  three  ker¬ 
nels  for  computing  the  flux  tensor  trace,  {nsx  ,ns  ,  n^,  no  , 
riDt ,  nDtt ,  nAx  >  nA  ,  tM()  are  the  full  set  of  filter  parameters 
that  would  need  to  be  specified  for  a  given  application.  Since 
we  use  spatially  isotropic  filters,  we  have  a  reduced  set  of  pa¬ 
rameters  to  specify,  with  nox  =  no  =  no3,nsx  =  nsy  = 
nss,nAJ.  =  n-A  =  nAa-  Typically  we  use  the  same  filter 
lengths  for  the  first  and  second  temporal  derivative  kernels 
(=  not)  and  for  the  spatial  smoothing  and  derivative  kernels, 
i.  e. ,  ns,  =  nos  ■  Thus,  there  remains  four  main  parameters 
of  the  flux  tensor;  ( nos,not ,  uab  ,  which  are  the  ID  fil¬ 
ter  sizes  of  the  spatial  derivative  filter,  the  temporal  derivative 
filter,  the  spatial  averaging  filter,  and  the  temporal  averaging 
filter  respectively.  In  medium  to  close  view  (indoor)  shots,  the 
choice  of  (5,  5, 5,  5)  for  filter  sizes  works  well  for  detection. 
For  the  very  far  view  sequences,  where  the  objects  may  be 
quite  small  and  moving  very  slowly,  a  (3,  9, 3, 3)  size  works 
best.  The  large  temporal  filter  size  helps  to  catch  the  slow  mo¬ 
tion,  the  small  spatial  filter  size  helps  to  detect  small  motion 
and  keeps  the  smoothing  to  a  minimum. 

2.2.  Sequential  Implementation  of  Flux  Tensor  Operator 

The  flux  tensor  implementation  uses  just  the  luminance  com¬ 
ponent  of  the  RGB  video  (1920  x  1080  pixels).  In  our  earlier 
work  [10],  we  described  a  reference  sequential  implementa¬ 
tion  that  has  minimum  memory  requirement  and  used  just  a 
single  input  image  First  In  First  Out  (FIFO)  buffer  of  size 
( not  +  nAt  —  1)  for  storing  the  input  frames  but  at  the  cost  of 
recomputing  all  spatiotemporal  derivatives  and  integrals  for 
each  new  video  frame.  This  can  be  a  significant  penalty  in 
terms  of  time  and  power  since  many  intermediate  filtering 
results  that  can  be  reused  now  have  to  be  recomputed.  A 
more  efficient  sequential  implementation  that  minimizes  re¬ 
dundant  computations  using  one  larger  FIFO  buffer  of  size 
4  *  (rtfl,  +  riA,  —  1),  for  the  intermediate  spatial  and  temporal 
derivatives,  and  storing  these  intermediate  results  to  be  reused 
across  temporal  stages  is  described  in  [11],  Here,  we  discuss 
a  new  alternative  approach  that  further  improves  memory  ef¬ 
ficiency  by  using  dual  FIFO  buffers  with  a  smaller  memory 
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Fig.  1:  Operator-centric  data  flow  view  of  the  various  stages  required  to  compute  the  flux  tensor  operator  on  a  3D  spatiotemporal 
volume  showing  the  task  dependency  relationships.  Note  that  the  magnitude  squaring  operators  are  explicitly  shown.  The 
summation  stage  is  done  prior  to  spatiotemporal  averaging  steps  for  memory  efficiency  at  the  cost  of  reduced  parallelism. 


footprint  of  3  *nnt  +  tia,  plus  a  few  additional  frames.  Each 
frame  of  the  input  sequence  is  first  convolved  with  spatial 
derivatives  and  smoothing  filters.  The  intermediate  results 
are  stored  as  frames  to  be  used  in  temporal  convolutions,  and 
pointers  to  these  frames  are  stored  in  a  FIFO  buffer.  The  size 
of  the  first  FIFO  structure  is  of  length  n Dt  and  for  each  in¬ 
put  frame  three  spatial  derivative  frames  Idxs  ■  I S,  i)  and 
Is.,. s  are  calculated  and  stored.  Hence,  the  number  of  frames 
that  need  to  be  stored  in  the  first  FIFO  structure  is  3 nnt- 
Once  riDt  frames  are  processed  and  stored,  the  FIFO  struc¬ 
ture  has  enough  frames  for  calculation  of  the  temporal  deriva¬ 
tives.  Three  frames  of  storage  are  needed  to  hold  the  temporal 
derivatives  in  memory  for  the  current  timestep. 

Since  averaging  is  distributive  over  addition  for  linear  op¬ 
erators,  the  sum  of  squares  I^.t+Iyt  +  lft,  which  is  the  trace  of 
the  flux  tensor  matrix  is  computed  first,  then  spatial  averaging 
is  applied  to  this  result  and  stored  in  a  second  FIFO  structure 
of  size  n,At ,  to  be  used  in  the  temporal  part  of  averaging.  The 
numerical  expression  that  is  being  computed  is, 

Tr_Jf(x)  =  W(x-y)(lxt(y)  +  Iyt(y)  +  /tt(y))  (5) 

yeA f(x,y,t) 

where  A f  is  the  local  neighborhood  over  which  the  square  of 
the  second  derivatives  are  summed.  A  weighted  averaging 
filter,  such  as  a  Gaussian,  can  be  used  at  the  expense  of  ad¬ 
ditional  computing  cost.  Typically  box  filters  are  used  for 
a  power  efficient  implementation.  This  temporal  averaging 
FIFO  keeps  tia,  frames  at  a  time  and  produces  the  flux  ten¬ 
sor  trace  after  it  is  full.  Once  both  FIFO’s  are  full,  processing 
a  new  input  frame  causes  a  shift  of  pointers  in  both  FIFO’s, 
reusing  intermediate  results  from  previous  calculations  and 
reducing  the  total  computation  per  flux  tensor  output  frame. 

3.  PARALLEL  IMPLEMENTATION  OF  THE  FLUX 
TENSOR  OPERATOR  ON  THE  CELL/B.E. 

A  power  efficient  multicore  implementation  needs  to  take  full 
advantage  of  the  intrinsic  Cell/B.E.  architecture  specific  hard¬ 
ware  accelerations  in  order  to  use  the  best  choice  of  data  and 
task  partitioning  across  SPEs,  manage  memory  transfers,  and 


take  full  advantage  of  vectorization  and  using  local  buffers. 
The  3.2  GHz  SPEs  deliver  their  peak  performance  while  exe¬ 
cuting  a  fused  short  vector  multiply  add  instruction  (FMA)  on 
each  clock  cycle  that  operates  on  a  four  floating-point  element 
vector  to  complete  eight  floating  point  operations  in  SIMD 
fashion.  Thus  a  peak  peformance  of  25.6  single -precision 
GFLOP/s  per  SPE  can  be  obtained.  An  important  point  to 
note  is  that  the  SPEs  only  work  on  data  staged  in  their  local 
memory  (local  store).  However,  the  SPE  local  storage  is  a 
limited  resource  as  only  256  Kbytes  is  available  for  program, 
stack,  local  buffers  and  data  structures.  Making  sure  the  SPEs 
efficiently  receive  and  operate  on  current  data  without  exces¬ 
sive  buffering  is  critical  to  achieving  high  performance  on  the 
Cell/B.E.  architecture.  Rather  than  considering  cache  control 
and  the  impact  of  memory  bandwidth,  we  focus  on  structur¬ 
ing  data  movement  within  the  Cell/B.E.  processor  to  keep  the 
SPEs  busy,  and  dividing  the  application  into  vectorized  func¬ 
tions  to  make  efficient  use  of  the  SPE  hardware. 

We  implemented  and  tested  our  code  on  the  SONY  PS- 
3  which  is  a  very  energy  efficient  multicore  processor  with 
program  access  to  six  of  the  eight  Cell/B.E.  SPE  processors 
and  the  main/external  XDR  memory  on  the  PPE  is  limited 
to  256  MB  of  which  about  200  MB  is  available  to  the  Linux 
OS.  To  accommodate  the  required  frame  storage  buffers  and 
data  structures  for  the  parallel  implementation,  the  data  par¬ 
titioning  and  grouping  of  the  operations  needed  careful  con¬ 
sideration.  For  main  memory  the  parallel  algorithm  uses  two 
FIFO  buffers  of  size  njjt  and  n a ,  for  calculation  of  tempo¬ 
ral  derivatives  and  temporal  averaging.  In  the  sequential  im¬ 
plementation,  all  the  spatial  derivatives  are  kept  in  the  first 
FIFO  buffer  which  requires  3n/),  frames  to  be  stored  in  main 
memory.  Due  to  the  limited  shared  (global)  memory  on  the 
Cell/B.E.  and  taking  advantage  of  the  fast  communication 
bus  the  parallel  implementation  of  the  flux  tensor  is  mem¬ 
ory  efficient  and  stores  only  the  input  sequence  of  images  in 
the  first  FIFO  buffer  of  size  rr  />,  (instead  of  3n/),  for  stor¬ 
ing  the  spatial  derivatives)  and  recalculates  the  spatiotempo¬ 
ral  derivatives  for  each  successive  frame  at  the  cost  of  extra 
calculations.  We  estimate  the  amount  of  redundant  work  to 
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Fig.  2:  Data  partitioning  scheme  showing  Work  Unit  (WU) 
block  width  in  pixels  on  SPEs.  The  3D  block  represents  the 
spatiotemporal  3D  grid  of  input  data  that  needs  to  be  pro¬ 
cessed  to  produce  one  flux  tensor  output  frame.  The  3D  grid 
of  data  is  chunked  uniformly  to  distribute  to  each  SPE.  Since 
the  partition  is  too  large  to  fit  into  the  limited  SPE  memory, 
each  work  block  is  further  divided  into  smaller  work  units. 
The  WU  sizes  are  dependent  on  the  filter  sizes  as  labeled. 

be  between  a  factor  of  2  and  5  depending  on  the  filter  sizes 
shown  in  Table  1,  compared  to  the  sequential  version.  There 
is  no  overlap  in  the  work  unit  computation  between  adjacent 
SPEs,  as  shown  by  the  vertical  lines  and  faces  marked  in  red 
in  Figure  2  but  there  is  an  extra  quad  word  data  transfer  (16 
bytes)  on  left  and  right  sides  to  provide  pixel  padding  in  the 
^-convolution  direction.  The  second  buffer  FIF02  operates 
the  same  as  in  the  sequential  case. 

In  order  to  parallelize  across  SPEs,  the  data  needs  to  be 
partitioned  into  equal  work  blocks  amongst  different  SPEs  for 
optimal  performance.  Since  there  are  convolutions  in  three 
dimensions  (x,y,t),  the  whole  work  unit  can  be  visualized 
as  a  3D  block.  This  is  partitioned  into  as  many  overlapping 
blocks  as  there  are  number  of  active  SPEs.  The  data  is  fetched 
and  processed  one  row  at  a  time.  Due  to  finite  size  of  the  lo¬ 
cal  store  memory,  each  SPE  may  further  subdivide  the  work 
block  into  smaller  chunks  and  process  a  work  unit  width  of 
WU  columns  each  time.  The  data  partitioning  scheme  and 
full  work  unit  block  is  illustrated  in  Figure  2.  The  execu¬ 
tion  process  on  the  PPE  and  SPE  side  is  summarized  in  Algo¬ 
rithms  1  and  2  respectively.  Optimized  convolution  operators 
are  represented  using  the  (§)  symbol  without  the  explicit  loop 
unrolling  and  optimized  FMA  operations  explicitly  shown. 

4.  RESULTS  AND  DISCUSSION 

The  output  of  the  flux  tensor-based  video  object  detection  al¬ 
gorithm  applied  to  a  sample  video  sequence  from  the  ARL 
Force  Protection  Surveillance  System  (FPSS)  video  collec¬ 
tion  [27]  is  shown  in  Figure  3.  The  first  row  shows  color 


Algorithm  1  Parallel  Flux  Tensor:  PPE  side 

Input  :  Input  Image  sequence  I(x,  y,  t) 

Output  :  Flux  Trace  frame  Tr_Jp(x,  y,  t  —  \np,t/  2]  —  \nAt/  2J ) 

1 :  for  each  time  t  do 
2:  Push(7(a:,  y,  t),  FIFOl) 

3:  Initialize  number  of  intermediate  flux  frames,  7Vm  <—  0 

4:  if  FIFOl  contains  n  jj.  frames  then 

5:  Partition  data  into  blocks. 

6:  Put  SPE  control  block  information  including  work  unit  W  and 

output  location  for  intermediate  flux  F  and  final  output  Tr_Jp. 
7:  Set  up  SPE  threads  and  wait  for  results  F,  Tr_Jp. 

8:  Push(F,  FIF02) 

9:  Nm  -4—  Nm  +  1 

10:  if  Nm  >  nAi  then 

1 1 :  Write  output  Tr.Jp 

12:  end  if 

13:  end  if 

14:  end  for 


Algorithm  2  Parallel  Flux  Tensor:  SPE  i 

Input  :  Images  in  FIFOl,  ATm,  starting  col  Ci,  and  work  block  width  W . 
Output  :  Blocks  of  Intermediate  Flux  into  FIF02  and  flux  trace  Tr_Jp 

i 

for  each  row  r  of  Work  Block  do 

2 

Load  from  FIFO  1 ,  pixel  data  Ir  from  column  Ci 
Ci  +  W  +  \nDs/2\  into  local  store 

-  [nDa/2j  upto 

3 

Push(/r  <g>  Sx,Isx_buffer); 

Push(7r  <g>  Dx,IDj._buffer); 

4 

if  Isx  jbuffer  and  lDx_buffer  have  riDs  rows 

then 

5 

ISxDy  =  Is ,  ®  Dy\ 

6 

Isxsy  =  Is.  ®  Sy-, 
lDxsy  =  Idx  ®  Sv ; 

7 

!yt  =  IsxDy  ®Dp, 

Itt  =  IS^Sy  8  Dtt\ 

I xt  =  Idxsv  ®  Dp, 

Fr  =  Ixt  +  lyt  +  Itt  > 

8 

Push(Fr,  FIF02); 

PoP  (I  Sx  Jbuffer): 

Pop(IDa:  buffer); 

9 

end  if 

10 

end  for 

11 

if  Nm  >  it  a  ,  then 

12 

for  each  row  r  of  Work  Block  do 

13 

Load  from  FIF02,  Fr  data  from  column  Ci  - 
Ci  +  W  +  \tias  /2J  into  local  store 

-  [nAj2\  upto 

14 

Push(Fr  ®  Ax,IAjc_buffer); 

15 

if  lAxjt>uf  fer  has  uas  rows  then 

16 

I Ax  Ay  -  lAx  0  Ay\ 

17 

Tr_Jp  =  IAxAy  ®  At; 

Pop  (IAa;_buf  fer); 

18 

end  if 

19 

end  for 

20 

end  if 
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Fig.  3:  Output  of  flux  tensor  motion  estimation  and  blob 
extraction  algorithm  on  selected  frames  of  color  visible  and 
FLIR  (forward  looking  long -wave  infrared)  data  from  the 
ARL  FPSS  dataset  [27], 


Table  1:  Speedup  and  performance  to  power  ratios  of  the  par¬ 
allel  multicore  PS-3  Cell/B.E.  implementation  with  6  SPEs 
using  135  watts  is  compared  to  the  optimized  sequential  im¬ 
plementation  running  on  an  Intel  Xeon  core  in  a  Dell  1850 
server  using  550  watts.  Sequential  performance  in  frames 
per  second  are  shown  in  the  (T*eq)  1  columns.  Parallel  PS- 
3  speed-up  (SgS~3)  is  compared  to  the  sequential  implemen¬ 
tation  running  on  an  Intel  Xeon  CPU.  The  power  efficiency 
improvement  of  the  parallel  implementation  compared  to  the 
sequential  implementation  are  shown  in  the  PPR  columns. 


Filter  Configuration 

HD  video 

SD  video 

riDs 

nDt 

uas 

nAt 

(7fq)-1 

cPS-3 

o6 

PPR 

(Tfq)-‘ 

cPS-3 

Dq 

PPR 

3 

3 

3 

3 

1.75 

40.1 

164 

17.32 

18.2 

74 

5 

3 

5 

3 

1.68 

38.6 

157 

14.35 

20.0 

82 

7 

3 

7 

3 

1.54 

38.1 

155 

13.02 

19.3 

79 

9 

3 

9 

3 

1.37 

39.6 

161 

11.52 

19.6 

80 

3 

5 

3 

5 

1.53 

30.8 

125 

13.99 

14.7 

60 

5 

5 

5 

5 

1.42 

29.6 

120 

12.28 

14.9 

61 

7 

5 

7 

5 

1.34 

28.7 

117 

10.99 

15.2 

62 

9 

5 

9 

5 

1.23 

28.3 

115 

10.05 

14.8 

60 

3 

7 

3 

7 

1.34 

25.6 

104 

12.01 

13.5 

55 

5 

7 

5 

7 

1.25 

24.7 

101 

10.44 

13.2 

54 

7 

7 

7 

7 

1.18 

23.6 

96 

9.88 

11.9 

48 

9 

7 

9 

7 

1.11 

14.2 

58 

9.03 

12.0 

49 

3 

9 

3 

9 

1.24 

22.0 

89 

10.45 

12.0 

49 

5 

9 

5 

9 

1.14 

21.3 

87 

9.34 

11.3 

46 

7 

9 

7 

9 

1.09 

12.4 

51 

8.63 

11.1 

45 

9 

9 

9 

9 

1.02 

13.0 

53 

7.97 

10.7 

43 

and  long-wave  infrared  frames  from  the  original  video  se¬ 
quence.  The  second  and  third  rows  show  the  grayscale  flux 
tensor  response  and  the  thresholded  binary  masks  respectively 
using  the  flux  tensor  motion  analysis  with  (5, 5,  5,  5)  filters, 
followed  by  grayscale  closing  (circular  structuring  element 
of  radius  5)  and  using  histogram  based  thresholding,  adap¬ 
tively  switching  between  global  Otsu  and  80%  cumulative 
histogram  value.  The  colored  blobs  show  the  detected  moving 
objects  after  post  processing  steps  including  morphological 
noise  removal  using  area  opening  and  connected  component 
labeling  to  identify  contiguous  regions.  The  pink  blobs  are  as¬ 
sociated  with  two  people  walking  in  the  far  background.  The 
FLIR  channel  is  not  affected  by  shadows  and  produces  more 
compact  blobs  of  moving  objects  suitable  for  tracking. 

The  sequential  code  was  tested  on  a  Dell  PowerEdge  1850 
server  running  CentOS  Linux  5.4  using  a  single  core  of  a  dual 
CPU  dual  core  Intel  Xeon  2.8  GHz  with  2  MB  of  cache  per 
core,  4GB  of  memory  and  an  800MHz  front  side  bus  com¬ 
piled  using  gcc  -03  version  4.1.2.  The  parallel  code  was 
tested  on  a  PS3  Cell/B.E.  with  6  SPEs  using  an  appropri¬ 
ate  SPE  work  unit  for  HD  sized  (1920  x  1080,  WU=320 
or  smaller)  and  Standard  Definition  (SD)  sized  (640  x  480, 
WU=112  pixels)  images.  The  PS-3  uses  about  135  watts 
while  the  Dell  PowerEdge  1850  uses  about  550  watts  for  sys¬ 
tem  operation  including  CPU,  peripheral  devices,  operating 


system,  multitasking,  etc.  Total  system  power  was  used  to 
measure  the  performance  to  power  efficiency  ratios  without 
doing  detailed  power  measurements  that  can  become  com¬ 
plex  to  instrument  and  compare.  The  work  unit  size  that  can 
be  accommodated  by  one  SPE  with  256  KB  of  local  store  de¬ 
pends  on  the  size  of  the  3D  convolution  filters,  especially  the 
temporal  filters,  data  alignment  and  partitioning  requirements 
(usually  multiples  of  16  bytes).  Parallel  performance  bench¬ 
marking  done  on  an  IBM  QS20  and  QS22  Blade  servers  with 
dual  Cell/B.E.s  both  running  Fedora  Linux  all  compiled  using 
gcc  -03  version  4.1.2  will  be  reported  elsewhere. 

We  compared  the  performance  between  the  sequential  and 
parallel  implementations  of  flux  tensor  for  different  filter  con¬ 
figurations  on  two  different  frame  sizes  of  video  streams  us¬ 
ing  3D  grids.  Speed-up  using  p  processors  was  calculated  as, 
SpS  3  =  7jScq/TpS‘3  where  Tp  is  the  average  time  measured 
across  p  processors  to  complete  the  flux  tensor  computation 
for  one  frame  and  Tf eq  is  the  time  taken  for  the  single  core 
sequential  implementation;  p  =  6  SPEs  on  the  PS-3.  The 
performance  to  power  efficiency  improvement  ratio  was  cal¬ 
culated  as,  PPR  =  SpS'3  where  Pa  is  the  system  power 
used  by  architecture  A  and  5pS‘3  is  the  speed-up  ratio. 

Table  1  shows  the  range  of  spatial  and  temporal  sizes  for 
both  derivative  and  integral/averaging  filters  varying  between 
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3,  5,  7  and  9  that  were  used  for  performance  benchmarking. 
The  sequential  frame  rate  or  inverse  of  the  time  to  compute 
one  frame  on  the  Intel  Xeon  processor  is  given  in  column 

(TSe9)-l 

in  frames  per  second,  the  speed-up  measured  on 
the  PS-3  Cell/B.E.  platform  compared  to  the  sequential  per¬ 
formance  is  given  in  column  SgS'3.  For  the  smallest  deriva¬ 
tive  and  integral  filters  of  size  3  the  speed-up  of  the  parallel 
implementation  compared  to  the  sequential  performance  was 
more  than  a  factor  of  40  even  though  there  are  only  six  com¬ 
putational  cores.  The  super-linear  speed-up  behavior  is  due 
to  the  use  of  extensive  vectorization,  loop  unrolling  and  FMA 
operations  to  implement  the  flux  tensor  convolution  kernels 
despite  the  additional  work  done  by  the  SPEs  recomputing  in¬ 
termediate  results  in  the  parallel  implementation  compared  to 
the  sequential  version.  As  the  filter  sizes  increase  the  speed¬ 
up  gain  decreases  since  the  total  volume  of  computations  in¬ 
creases  faster  on  the  Cell  (linearly  with  the  size  of  the  fil¬ 
ter)  since  the  parallel  implementation  needs  to  recompute  all 
of  the  intermediate  spatial  derivatives,  whereas  the  sequential 
implementation  is  able  to  store  and  reuse  intermediate  results 
at  the  cost  of  extra  memory.  For  larger  filter  sizes  the  limited 
local  store  on  the  SPEs  requires  smaller  data  chunks  (work 
unit  width  in  Figure  2)  that  then  requires  more  than  six  threads 
of  execution  and  results  in  two  stages  of  computation  which 
reduces  speed-up;  the  filter  configurations  needing  two  stages 
of  execution  are  shown  in  bold  font  in  Table  1.  This  can  be 
partly  mitigated  by  using  the  more  expensive  IBM  Cell/B.E. 
Blade  processors  which  have  a  lot  more  main  memory  but 
also  higher  power  consumption. 

The  speed-up  of  the  multicore  flux  tensor  implementation 
ranged  from  a  factor  of  1 1  to  20  for  the  smaller  SD  video 
frame  sizes  to  between  12  and  40  for  the  larger  HD  frame  size 
video  streams  as  shown  in  Table  1.  The  results  for  16  different 
filter  configurations  for  both  HD  and  SD  video  frame  sizes  are 
compared  and  show  substantial  improvement  in  terms  of  both 
parallel  speed-up  as  well  as  performance  to  power  efficiency. 
Our  implementation  on  the  PS-3  Cell/B.E.  was  able  to  deliver 
34.9  fr/s  x  1.32  GFLOPs/frame  =  46  GFLOP/s  which  is  30% 
of  single -precision  peak  performance  (153.6  GFLOP/s)  but 
significantly  better  than  the  expected  memory-intensive  peak 
of  12.8  GFLOP/s.  The  identical  code  reaches  39%,  35%  and 
24%  of  peak  performance  on  the  QS20  (410  GFLOP/s  for  16 
SPEs)  with  6,  8  and  16  SPEs  respectively  for  the  same  filter 
size  configuration  of  nuB  =  9 ,nnt  =  5 ,nAs  =  9 ,nAt  =  5. 
An  earlier  implementation  that  used  better  data  alignments 
but  could  only  handle  a  limited  number  of  filter  sizes  and 
image  widths  was  able  to  reach  68  fr/s  and  58%  of  peak  on 
the  QS20.  We  found  that  explicit  memory  management  and 
some  assembly  coding  on  the  Cell/B.E.  is  required  to  reach 
high  performance  even  though  this  hand  tuning  incurs  addi¬ 
tional  programming  effort.  Multicore  GPU  architectures  are 
also  well  suited  for  computer  vision  algorithms  [28].  In  future 
work  we  will  compare  the  power  efficiency  of  the  flux  tensor 
on  GPUs  using  CUDA  or  OpenCL. 


5.  CONCLUSIONS 

The  flux  tensor  operator  estimates  significant  orientations  in 
multidimensional  gridded  datasets  and  is  an  efficient  tech¬ 
nique  for  moving  object  detection  in  video  datasets.  The 
parallel  algorithm  implemented  for  the  Cell/B.E.  PS -3  archi¬ 
tecture  with  six  SPE  computational  cores  achieved  a  speed-up 
improvement  factor  of  40  compared  to  the  sequential  algo¬ 
rithm  using  the  smallest  filter  sizes  which  offers  substantial 
energy  efficiency  optimization  choices.  The  super-linear 
speed-up  behavior  is  due  to  the  extensive  use  of  vector¬ 
ized  floating  point  operations,  FMA  instructions  and  double 
buffering  to  overlap  computation  and  communication.  Using 
larger  filter  sizes  the  speed-up  gain  decreased  to  a  factor  of  12 
since  the  total  volume  of  computations  increases  faster  on  the 
Cell/B.E.  as  the  parallel  implementation  must  recompute  the 
intermediate  spatial  derivatives  in  comparison  to  the  sequen¬ 
tial  implementation  which  uses  significantly  more  memory. 
For  larger  filter  sizes  the  limited  local  store  on  the  SPEs  also 
leads  to  smaller  data  partitioning  sizes  that  requires  more 
than  six  threads  of  execution  which  results  in  two  stages  of 
computation  thus  reducing  speed-up.  For  all  filter  sizes  tested 
the  parallel  flux  tensor  algorithm  was  able  to  exceed  realtime 
performance  requirements  using  a  single  PS-3  Cell/B.E.  pro¬ 
cessor  for  SD  sized  video  streams  and  for  most  of  the  filter 
sizes  for  HD  sized  video  streams.  The  lower  power  require¬ 
ments  for  the  multicore  PS-3  Cell/B.E.  compared  to  an  Intel 
Xeon  processor  makes  the  energy  efficiency  performance  to 
power  ratio  of  the  flux  tensor  more  than  160  times  better 
for  the  smaller  filter  sizes  and  more  than  50  times  better  for 
the  larger  filter  sizes  for  processing  HD  video  streams.  The 
dependency  of  the  energy  efficiency  factor  on  the  flux  tensor 
filter  size  provides  an  additional  dimension  for  energy  opti¬ 
mization  based  on  output  image  quality,  that  is  using  slightly 
smaller  filter  sizes  for  a  marginal  reduction  in  performance. 
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